Skip to main content

Step 2: Joining Multiple Geospatial Datasets

After creating all 3 of our datasets at a H3 cell grid size of 3, we can now join them together.

Joining logic

Here's a quick summary of what each of our datasets contains:

Crop

  • Dataset size: ~700k rows
hexcrop_typearea
59965833237810380741713
5996698665127772155020854
59965752814547763112217505
59969907873284095912110368586
59969703432840806312217754

Elevation

Dataset size: ~68k rows

hexmean_value
599647257804931071453.75
599647264247382015375.32
599647267468607487392.52
599647276058542079401.93
599647280353509375410.52

Temperature

Dataset size: ~68k rows

hexdaily_mean
59974082796244172716.46
59974079467644518316.51
59974082474121625516.40
59974083762611814314.99
59974083977360179114.93

Notice a few things:

  • We have ~10x more rows in the Crop dataset than the Elevation and Temperature datasets, which both have the same number of H3 cells.
  • The Elevation & Temperature datasets each have 1 H3 cell < - > 1 value. The Crop dataset has multiple rows per each H3 cell.

We can write a simple UDF to return all the different dataset at once for a single H3 cell to see this:

@fused.udf
def udf():
# Assuming you have named your UDFs as follows:
elevation_data = fused.run('elevation_hex_mean_height')
cdl_data = fused.run("cdl_data_all_crops_staging")
temp_data = fused.run("era5_hex_mean_temp_foss4g")

# Random h3 cell within the US
random_hex = 599691796615790591

# Filter all 3 dataframes by the same hex value
filtered_elevation = elevation_data[elevation_data['hex'] == random_hex]
filtered_cdl = cdl_data[cdl_data['hex'] == random_hex]
filtered_temp = temp_data[temp_data['hex'] == random_hex]

# Print the contents of all 3 filtered dataframes
print("Elevation Data:")
print(filtered_elevation)
print("\nCDL Data:")
print(filtered_cdl)
print("\nTemperature Data:")
print(filtered_temp)

return filtered_elevation

This returns:

Elevation Data:
hex mean_value
46561 599691796615790591 307.809524

CDL Data:
crop_type hex area chunk_id year
355599 122 599691796615790591 7631733 577199624117288959 2024
355600 123 599691796615790591 2211934 577199624117288959 2024
355601 68 599691796615790591 18305536 577199624117288959 2024
355602 37 599691796615790591 5207407 577199624117288959 2024
355603 131 599691796615790591 60333 577199624117288959 2024
... ... ... ... ... ...
365701 6 599691796615790591 10317 577199624117288959 2024
365702 143 599691796615790591 12347 577199624117288959 2024
365703 29 599691796615790591 14997 577199624117288959 2024
365704 221 599691796615790591 15215 577199624117288959 2024
365705 38 599691796615790591 16505 577199624117288959 2024

[68 rows x 5 columns]

Temperature Data:
hex daily_mean
41582 599691796615790591 15.31
  • 1 H3 cell < - > 1 value for Elevation and Temperature
  • 1 H3 cell < - > 68 values for Crop!

This is because the Crop dataset creates a new row for each crop type. Recall that a resolution 5 H3 cell is 252 km², so a single cell can contain multiple crop types.

To join our datasets we need to decide on a logic. We'll decide to:

Aggregate the Crop dataset by most dominant crop type for each H3 cell.

Aggregating crop by most common

Aggregating crop dataset by most common crop

@fused.udf
def udf():
cdl_data = fused.run("cdl_data_all_crops_staging")

# Aggregate CDL by most common crop by area. 1 h3 cell <-> 1 row: crop_rank_1 and crop_area
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()

# Group by hex cell, rank crops by area, keep top crop per cell
crop_agg = con.execute("""
SELECT
hex,
crop_type as crop_rank_1,
SUM(area) as crop_area,
year
FROM cdl_data
GROUP BY hex, crop_type, year
QUALIFY ROW_NUMBER() OVER (PARTITION BY hex, year ORDER BY SUM(area) DESC) = 1
""").df()

print(f"{crop_agg.T=}")


return crop_agg

We can refine this further by:

  • Filtering out any crops we're not interested in like shrubs, grass, forest, etc.
  • Taking neighboring cells into account to get more accurate statistics using k-ring
@fused.udf
def udf(k_ring_size: int = 1):
cdl_data = fused.run("cdl_data_all_crops_staging")

# Aggregate CDL by most common crop by area. 1 h3 cell <-> 1 row: crop_rank_1 and crop_area
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()

# Removing 'crop types' from CDL dataset we're not interested in
crops_to_remove = [
64, # Shrubland
152, # Shrubland
176, # Pasture/Hay
63, # Forest
142, # Forest
141, # Forest
190, # Woody Wetlands
111, # Perennial Ice/Snow
195, # Water
131, # Barren
121, # Developed/Open Space
143, # Mixed Forest
]
cdl_data = cdl_data[~cdl_data['crop_type'].isin(crops_to_remove)]

crop_agg = con.execute("""
WITH k_ring_expanded AS (
SELECT
hex as center_hex,
unnest(h3_grid_disk(hex, ?)) as ring_hex
FROM cdl_data
),
k_ring_data AS (
SELECT
k.center_hex,
c.crop_type as crop_type,
c.area as crop_area
FROM k_ring_expanded k
JOIN cdl_data c ON CAST(c.hex AS BIGINT) = CAST(k.ring_hex AS BIGINT)
),
aggregated_crops AS (
SELECT
center_hex,
crop_type,
SUM(crop_area) as total_area
FROM k_ring_data
GROUP BY center_hex, crop_type
),
ranked_crops AS (
SELECT
center_hex,
crop_type,
total_area,
ROW_NUMBER() OVER (PARTITION BY center_hex ORDER BY total_area DESC) as rank
FROM aggregated_crops
)
SELECT
center_hex as hex,
crop_type as crop_rank_1,
total_area as area_rank_1
FROM ranked_crops
WHERE rank = 1
""", [k_ring_size]).df()

print(f"{crop_agg.T=}")


return crop_agg

This returns:

hexcrop_rank_1area_rank_1
59923020896429670308.98e+08
59923879030895411102.03e+09
.........
59923879460392140702.02e+09
59923881607875788702.02e+09

Shape: (51299, 3)

Joining datasets

We now have 3 datasets, with unique hex values each, so we can join them together:

@fused.udf
def udf(k_ring_size: int = 1):
elevation_data = fused.run('elevation_hex_mean_height')
cdl_data = fused.run("cdl_data_all_crops_staging")
temp_data = fused.run("era5_hex_mean_temp_foss4g")

# Aggregate CDL by most common crop by area. 1 h3 cell <-> 1 row: crop_rank_1 and crop_area
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()

# Removing 'crop types' from CDL dataset we're not interested in
crops_to_remove = [
64, # Shrubland
152, # Shrubland
176, # Pasture/Hay
63, # Forest
142, # Forest
141, # Forest
190, # Woody Wetlands
111, # Perennial Ice/Snow
195, # Water
131, # Barren
121, # Developed/Open Space
143, # Mixed Forest
]
cdl_data = cdl_data[~cdl_data['crop_type'].isin(crops_to_remove)]

crop_agg = con.execute("""
WITH k_ring_expanded AS (
SELECT
hex as center_hex,
unnest(h3_grid_disk(hex, ?)) as ring_hex
FROM cdl_data
),
k_ring_data AS (
SELECT
k.center_hex,
c.crop_type as crop_type,
c.area as crop_area
FROM k_ring_expanded k
JOIN cdl_data c ON CAST(c.hex AS BIGINT) = CAST(k.ring_hex AS BIGINT)
),
aggregated_crops AS (
SELECT
center_hex,
crop_type,
SUM(crop_area) as total_area
FROM k_ring_data
GROUP BY center_hex, crop_type
),
ranked_crops AS (
SELECT
center_hex,
crop_type,
total_area,
ROW_NUMBER() OVER (PARTITION BY center_hex ORDER BY total_area DESC) as rank
FROM aggregated_crops
)
SELECT
center_hex as hex,
crop_type as crop_rank_1,
total_area as area_rank_1
FROM ranked_crops
WHERE rank = 1
""", [k_ring_size]).df()

# Join elevation with aggregated CDL crop data
merged = elevation_data.merge(
crop_agg,
on='hex',
how='left'
).rename(columns={'mean_value': 'elevation'})

# Join with ERA5 temperature data
merged = merged.merge(
temp_data,
on='hex',
how='left'
)

# Round elevation to 1 decimal place
merged['elevation'] = merged['elevation'].round(1)

return merged

Returns:

hexelevationcrop_rank_1area_rank_1daily_mean
599650153686630399357.41.09.152000e+0913.90
599650161202823167413.51.01.124163e+1013.45
...............
599747753597206527255.90.02.030604e+098.93
599747766482108415283.60.02.031832e+098.15

Shape: (68044, 5)

Using fused common functions for simpler H3 operations

Note that we're using the Fused common functions to simplify DuckDB queries using H3 functions:

# inside our udf
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()
crop_agg = con.execute("""SELECT ...""").df()
# rest of logic

This function provides some benefits:

  • Pre-installed dependencies for H3 operations
  • No need to register data as tables in DuckDB as a preliminary step

Summary & Next Steps

We now have a joined dataset that contains, for each H3 res 5 cell:

  • Elevation data
  • Crop data by most common crop + area of said crop
  • Temperature data

Next: